Extracting Chemical Information from Thai Unstructured Text with Unknown Phrase Boundaries

نویسنده

  • Peerasak Intarapaiboon
چکیده

Due to the limitations of language-processing tools for the Thai language, pattern-based information extraction from Thai documents requires supplementary techniques. Based on sliding-window rule application and extraction filtering, we present a framework for extracting multi-slot frames describing chemical reactions and those describing chemical syntheses from Thai unstructured text with unknown target-phrase boundaries. A supervised rule learning algorithm is employed for automatic construction of pattern-based extraction rules from hand-tagged training phrases. A filtering method is devised for removal of incorrect extraction results based on features observed from text portions appearing between adjacent slot fillers in source documents. The experimental results show that the filtering components improve precision while preserving recall satisfactorily.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...

متن کامل

تعیین مرز و نوع عبارات نحوی در متون فارسی

Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...

متن کامل

Standardization of Unstructured Textual Data into Semantic Web Format

Analysis done on the nature of the data posted on the World Wide Web (WWW) reveal that more than 80% of the data over the WWW is in unstructured text format. Hence extracting information from text is of paramount importance both for academic and business purposes. Simultaneously, evolution of web technology led to the novel concept of Semantic Web, which is an extension of the current web in wh...

متن کامل

A Collaborative Framework for Collecting Thai Unknown Words from the Web

We propose a collaborative framework for collecting Thai unknown words found on Web pages over the Internet. Our main goal is to design and construct a Webbased system which allows a group of interested users to participate in constructing a Thai unknown-word open dictionary. The proposed framework provides supporting algorithms and tools for automatically identifying and extracting unknown wor...

متن کامل

Automatic Thai Keyword Extraction from Categorized Text Corpus

Information Extraction (IE) is a process of discovering implicit and potentially important keywords underlying unstructured natural-language text corpus. Most previously proposed solutions to IE were accomplished by constructing a set of words from given text corpus during the preprocessing step. Due to the inherent chracteristic of Thai written language which does not explicitly use any word d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012